Introduction to Data Science with R

Further Tips and Tricks

Ina Bornkessel-Schlesewsky

February 1, 2024

Plan for today

  • Dealing with missing values
  • Helpful functions for recoding data
  • Working with factors
  • Q&A

Missing values

Finding missing values

  • How can you determine whether (and where) your data contain missing values?
  • Let’s return to the penguins data, just for old times’ sake
head(penguins, n=4)
# A tibble: 4 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           NA            NA                  NA          NA
# ℹ 2 more variables: sex <fct>, year <int>

Finding missing values

  • Using filter() doesn’t work …
penguins |> 
  filter(bill_length_mm == NA)
# A tibble: 0 × 8
# ℹ 8 variables: species <fct>, island <fct>, bill_length_mm <dbl>,
#   bill_depth_mm <dbl>, flipper_length_mm <int>, body_mass_g <int>, sex <fct>,
#   year <int>
  • Use is.na() instead:
penguins |> 
  filter(is.na(bill_length_mm))
# A tibble: 2 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen             NA            NA                NA          NA
2 Gentoo  Biscoe                NA            NA                NA          NA
# ℹ 2 more variables: sex <fct>, year <int>

Excursus: why a specialised function?

from R4DS Ch. 12:

Missing values represent the unknown so they are “contagious”: almost any operation involving an unknown value will also be unknown

NA > 5
[1] NA
10 == NA
[1] NA
NA == NA
[1] NA

This is why we need to use is.na() instead.

Identifying rows with no / any missing values

  • The complete.cases() function identifies rows that do not include any missing values:
penguins[complete.cases(penguins),]
# A tibble: 333 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           36.7          19.3               193        3450
 5 Adelie  Torgersen           39.3          20.6               190        3650
 6 Adelie  Torgersen           38.9          17.8               181        3625
 7 Adelie  Torgersen           39.2          19.6               195        4675
 8 Adelie  Torgersen           41.1          17.6               182        3200
 9 Adelie  Torgersen           38.6          21.2               191        3800
10 Adelie  Torgersen           34.6          21.1               198        4400
# ℹ 323 more rows
# ℹ 2 more variables: sex <fct>, year <int>
  • Use “!” as a negation to identify only the rows that do contain missing values
penguins[!complete.cases(penguins),]
# A tibble: 11 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           NA            NA                  NA          NA
 2 Adelie  Torgersen           34.1          18.1               193        3475
 3 Adelie  Torgersen           42            20.2               190        4250
 4 Adelie  Torgersen           37.8          17.1               186        3300
 5 Adelie  Torgersen           37.8          17.3               180        3700
 6 Adelie  Dream               37.5          18.9               179        2975
 7 Gentoo  Biscoe              44.5          14.3               216        4100
 8 Gentoo  Biscoe              46.2          14.4               214        4650
 9 Gentoo  Biscoe              47.3          13.8               216        4725
10 Gentoo  Biscoe              44.5          15.7               217        4875
11 Gentoo  Biscoe              NA            NA                  NA          NA
# ℹ 2 more variables: sex <fct>, year <int>

Alternative for implicit missing values: use anti_join()

Example from R4DS Ch. 18.3.3

flights |> 
  distinct(faa = dest) |> 
  anti_join(airports)
# A tibble: 4 × 1
  faa  
  <chr>
1 BQN  
2 SJU  
3 STT  
4 PSE  
  • distinct() finds the unique entries in a column; can also be used without an argument to remove duplicate rows in a data frame
  • anti_join() keeps those rows in the first data frame that don’t have a match in the second, i.e. the destination airports that don’t have an entry in the airports df

Excursus: semi_join()

  • semi_join() is the counterpart of anti_join(): it only keeps rows that have a match in the data frame being joined
flights |> 
  distinct(faa = dest) |> 
  semi_join(airports)
# A tibble: 101 × 1
   faa  
   <chr>
 1 IAH  
 2 MIA  
 3 ATL  
 4 ORD  
 5 FLL  
 6 IAD  
 7 MCO  
 8 PBI  
 9 TPA  
10 LAX  
# ℹ 91 more rows
  • anti_join() and semi_join() are known as filtering joins, contrasting with the mutating joins (left_join() etc.) that we discussed in session 7
  • see R4DS Ch. 19 for more information

Dropping missing values

  • You can drop missing values using the drop_na() function
  • It can be used either to drop all rows with any missing values …
penguins |> 
  drop_na()
# A tibble: 333 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           36.7          19.3               193        3450
 5 Adelie  Torgersen           39.3          20.6               190        3650
 6 Adelie  Torgersen           38.9          17.8               181        3625
 7 Adelie  Torgersen           39.2          19.6               195        4675
 8 Adelie  Torgersen           41.1          17.6               182        3200
 9 Adelie  Torgersen           38.6          21.2               191        3800
10 Adelie  Torgersen           34.6          21.1               198        4400
# ℹ 323 more rows
# ℹ 2 more variables: sex <fct>, year <int>

Dropping missing values

  • … or restricted to a particular column (or combination of columns)
penguins |> 
  drop_na(bill_length_mm)
# A tibble: 342 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           36.7          19.3               193        3450
 5 Adelie  Torgersen           39.3          20.6               190        3650
 6 Adelie  Torgersen           38.9          17.8               181        3625
 7 Adelie  Torgersen           39.2          19.6               195        4675
 8 Adelie  Torgersen           34.1          18.1               193        3475
 9 Adelie  Torgersen           42            20.2               190        4250
10 Adelie  Torgersen           37.8          17.1               186        3300
# ℹ 332 more rows
# ℹ 2 more variables: sex <fct>, year <int>

Recoding data: some useful functions

Why recode data?

  • You might be working with a data set where the categories provided are not exactly what you want for your analysis _ In this case, you might want to create new ones …
  • This can be accomplished quite straightforwardly with the case_when() function as part of mutate()

Example

  • Recall the Himalayan Peaks data from Tidy Tuesday
  • One of the data frames listed expedition members
members <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/members.csv')

glimpse(members)
Rows: 76,519
Columns: 21
$ expedition_id        <chr> "AMAD78301", "AMAD78301", "AMAD78301", "AMAD78301…
$ member_id            <chr> "AMAD78301-01", "AMAD78301-02", "AMAD78301-03", "…
$ peak_id              <chr> "AMAD", "AMAD", "AMAD", "AMAD", "AMAD", "AMAD", "…
$ peak_name            <chr> "Ama Dablam", "Ama Dablam", "Ama Dablam", "Ama Da…
$ year                 <dbl> 1978, 1978, 1978, 1978, 1978, 1978, 1978, 1978, 1…
$ season               <chr> "Autumn", "Autumn", "Autumn", "Autumn", "Autumn",…
$ sex                  <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "M",…
$ age                  <dbl> 40, 41, 27, 40, 34, 25, 41, 29, 35, 37, 23, 44, 2…
$ citizenship          <chr> "France", "France", "France", "France", "France",…
$ expedition_role      <chr> "Leader", "Deputy Leader", "Climber", "Exp Doctor…
$ hired                <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ highpoint_metres     <dbl> NA, 6000, NA, 6000, NA, 6000, 6000, 6000, NA, 681…
$ success              <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ solo                 <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ oxygen_used          <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ died                 <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ death_cause          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ death_height_metres  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ injured              <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ injury_type          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ injury_height_metres <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

Example

  • For those climbers who died, what was the cause of death?
members |> 
  count(death_cause)
# A tibble: 13 × 2
   death_cause                     n
   <chr>                       <int>
 1 AMS                           102
 2 Avalanche                     369
 3 Crevasse                       27
 4 Disappearance (unexplained)    49
 5 Exhaustion                     41
 6 Exposure / frostbite           42
 7 Fall                          331
 8 Falling rock / ice             26
 9 Icefall collapse               16
10 Illness (non-AMS)              60
11 Other                          33
12 Unknown                        10
13 <NA>                        75413
  • How can we create larger groups from these quite specific categories?

Enter case_when()

Simple example: fall vs. all other causes

d_cause <-  members |> 
  drop_na(death_cause) |> 
  mutate(death_cause_grouped = case_when(
    death_cause == "Fall" ~ "Fall",
    TRUE ~ "Other")
  ) |> 
  select(death_cause, death_cause_grouped)

d_cause
# A tibble: 1,106 × 2
   death_cause        death_cause_grouped
   <chr>              <chr>              
 1 Avalanche          Other              
 2 Fall               Fall               
 3 Fall               Fall               
 4 Fall               Fall               
 5 Fall               Fall               
 6 Fall               Fall               
 7 Falling rock / ice Other              
 8 Avalanche          Other              
 9 Avalanche          Other              
10 Avalanche          Other              
# ℹ 1,096 more rows
d_cause |> 
  count(death_cause_grouped)
# A tibble: 2 × 2
  death_cause_grouped     n
  <chr>               <int>
1 Fall                  331
2 Other                 775

case_when(): a more complex example

  • Note how everything that we haven’t captured in our conditions gets turned into an NA; we could keep making the conditions more specific to avoid this
members |> 
  drop_na(death_cause) |> 
  mutate(death_cause_grouped = case_when(
    death_cause == "Fall" ~ "Fall",
    death_cause %in% c("Avalanche", "Falling rock / ice", "Icefall collapse") ~ "Avalanche / falling ice or rock",
    death_cause %in% c("AMS", "Illness (non-AMS") ~ "Illness")
    ) |> 
  select(death_cause, death_cause_grouped) |> 
  count(death_cause_grouped)
# A tibble: 4 × 2
  death_cause_grouped                 n
  <chr>                           <int>
1 Avalanche / falling ice or rock   411
2 Fall                              331
3 Illness                           102
4 <NA>                              262

From numeric to categorial

  • You can use the cut() function in base R to recategorise a numeric variable into intervals
members |> 
  mutate(age_group = cut(age, breaks = c(0,18,40,60,100))) |> 
  select(peak_id:age, age_group)
# A tibble: 76,519 × 7
   peak_id peak_name   year season sex     age age_group
   <chr>   <chr>      <dbl> <chr>  <chr> <dbl> <fct>    
 1 AMAD    Ama Dablam  1978 Autumn M        40 (18,40]  
 2 AMAD    Ama Dablam  1978 Autumn M        41 (40,60]  
 3 AMAD    Ama Dablam  1978 Autumn M        27 (18,40]  
 4 AMAD    Ama Dablam  1978 Autumn M        40 (18,40]  
 5 AMAD    Ama Dablam  1978 Autumn M        34 (18,40]  
 6 AMAD    Ama Dablam  1978 Autumn M        25 (18,40]  
 7 AMAD    Ama Dablam  1978 Autumn M        41 (40,60]  
 8 AMAD    Ama Dablam  1978 Autumn M        29 (18,40]  
 9 AMAD    Ama Dablam  1979 Spring M        35 (18,40]  
10 AMAD    Ama Dablam  1979 Spring M        37 (18,40]  
# ℹ 76,509 more rows

From numeric to categorical

  • Include labels
members |> 
  mutate(age_group = cut(age, breaks = c(0,18,40,60,100),
                         labels = c("minor","young",
                                    "middle-aged", "older"))) |> 
  select(peak_id:age, age_group)
# A tibble: 76,519 × 7
   peak_id peak_name   year season sex     age age_group  
   <chr>   <chr>      <dbl> <chr>  <chr> <dbl> <fct>      
 1 AMAD    Ama Dablam  1978 Autumn M        40 young      
 2 AMAD    Ama Dablam  1978 Autumn M        41 middle-aged
 3 AMAD    Ama Dablam  1978 Autumn M        27 young      
 4 AMAD    Ama Dablam  1978 Autumn M        40 young      
 5 AMAD    Ama Dablam  1978 Autumn M        34 young      
 6 AMAD    Ama Dablam  1978 Autumn M        25 young      
 7 AMAD    Ama Dablam  1978 Autumn M        41 middle-aged
 8 AMAD    Ama Dablam  1978 Autumn M        29 young      
 9 AMAD    Ama Dablam  1979 Spring M        35 young      
10 AMAD    Ama Dablam  1979 Spring M        37 young      
# ℹ 76,509 more rows

From numeric to categorical

members |> 
  mutate(age_group = cut(age, breaks = c(0,18,40,60,100),
                         labels = c("minor","young",
                                    "middle-aged", "older"))) |> 
  count(age_group)
# A tibble: 5 × 2
  age_group       n
  <fct>       <int>
1 minor         277
2 young       47757
3 middle-aged 22988
4 older        2000
5 <NA>         3497

Other cut-type functions

  • the functions cut_interval(), cut_number() and cut_width() are part of the tidyverse and provide some convenient options for creating categorical from numeric data

    • cut_interval() creates n groups of equal range
    • cut_number() creates n groups with an ~ equal number of approximations
    • cut_width() creates groups of width width
  • as for cut(), you can pass labels for the new column via the labels argument if desired

  • try using these to categorise the age data for the members data frame!

Where to from here?

More to explore in R4DS

  • While we’ve covered quite a bit of content here, the R4DS book provides a wealth of additional information that you might like to explore

Interested in modelling?


  • Yep, there’s a book for that too!
  • And, like R4DS, it’s freely available here

Q&A